map

map

1 Introduction

Project EDIFES now has 779 cleaned building datasets ingested into HBase and (around) 20 working building markers. The purpose of this report is to document preliminary work towards a complete cross-sectional study of all buildings and markers.

1.1 Data Characteristics

  • Total of 779 buildings
  • 209 buildings have square feet readings
  • 92 of these are ‘accurate’ and 117 are estimated
  • 424 buildings have a standardized building type
  • 155 buildings have number of floors information

1.2 Markers

Markers constantly change and results presented here reflect markers as of 12/1/2017.

  • Markers run on all building are as follows
  • Heating Type
  • EUI (where applicable)
  • Effective Thermal Resistance (where applicable)
  • Base-to-Peak Ratio
  • Summary Statistics
  • Data Quality Check
  • HVAC Size
  • Heating Oversized
  • Cooling Oversized
  • HVAC Schedule
  • Additional analysis covers
  • Climate zone (kgcz and ASHRAE) distribution
  • Annual Consumption
  • Correlations between weather conditions and electricity consumption

2 Building Types

Results were obtained via the StandardizeBuildingTypes function written by Shreyas Kamath.

Building Standardized Types
type count
Banking 1
Educational 74
Entertainment 10
Food Sales & Service 34
Healthcare 33
Industrial 32
Office 41
Other 25
Public services 15
Retail 106
Services 25
Skyscraper 15
Storage 7
Utilities 6

Completely pointless wordcloud of building types.

## Loading required package: RColorBrewer

3 Climate Zone Distribution

Results were obtained via functions I wrote for kgcz (mrk-climate_identifier.R) and ASHRAE (mrk-get_ashrae_cz.R) climate zone identification from latitude and longitude. The KGCZ is based on latitude and longitude with 0.5 degree precision, and the ASHRAE climate zone is based on querying the United States Census Bureau API to retrieve the county and matching that to a list of counties and climate zones.

3.1 Koppen Geiger Climate Zones

Image to orient ourselves.

3.1.1 Table of KGCZ climate zone counts

KGCZ
kgcz count
BSh 10
BSk 14
BWh 1
BWk 1
Cfa 424
Cfb 14
Csa 17
Csb 54
Dfa 163
Dfb 78

3.1.2 Plot of KGCZ Distribution

## [1] 0.7535302
  • 75% of building occurs in just 2 climate zones, Cfa and Dfa.
  • 4 climate zones have over 30 buildings (level of statistical significance according to the central limit theorem)

3.2 ASHRAE Climate Zones

Image to orient ourselves

3.2.1 Table of ASHRAE Climate Zones

ASHRAE CZ
a_cz count
2A 5
2B 2
3A 2
3B 32
3C 58
4A 425
4C 6
5A 242
5B 1

3.2.2 Plot of ASHRAE Climate Zone Distributions

  • 85.6% of buildings are in two climate zones, 4A and 5A
  • 4 climate zones with more than 30 buildings

4 Data Quality

Data quality check building currently does not work for 35 datasets (all those with 1 minute interval data).

4.1 Data Quality standards

4.2 Plot of All Quality before and after

4.3 Data Quality Changes through Cleaning

This shows all changes with more than 5 occurrences.

4.4 Data Quality by Sample Set

Now, to highlight the datasets that were not AAAP.

Non AAAP Quality before Cleaning
Sampleset AAAF AABF AABP AACP AADP BAAF BAAP BADP CAAP CADP DAAP
sampleset10 0 0 0 0 0 0 1 0 0 0 0
sampleset2 1 1 5 0 2 0 1 0 1 0 2
sampleset3 1 0 2 0 1 0 30 2 28 1 6
sampleset4 0 0 1 0 2 0 0 0 0 0 2
sampleset5 2 0 1 0 0 1 0 0 0 0 0
sampleset6 0 0 0 0 0 0 0 0 0 0 1
sampleset7 0 0 0 0 0 0 0 0 0 0 0
sampleset8 0 0 0 0 0 0 11 0 18 0 0
sampleset9 0 0 0 1 6 0 0 0 0 0 1
Non AAAP Quality after Cleaning
Sampleset AAAF BAAF BAAP CAAF CAAP DAAF
sampleset10 0 1 0 0 0 0
sampleset2 4 0 0 0 0 0
sampleset3 18 5 4 5 2 7
sampleset4 2 0 0 0 0 1
sampleset5 2 1 0 0 0 0
sampleset6 1 0 0 0 0 0
sampleset7 0 0 0 0 0 0
sampleset8 7 4 2 6 2 2
sampleset9 1 0 0 0 0 1

4.5 Plots of data quality by sample set

5 Heating Type

## Heating Type by Location

In the following map, the Starbucks locations are the stars. These demonstrate considerably different behavior than the other buildings which might explain why they are classified as non-electrical heating despite being located in the Southwest.

5.1 Heating Type Determination

The current process to determine the heating type is as follows

  • Data subset to business days between 7 am and 7 pm
  • Linear model between energy use and temperature
  • Slope is extracted from the model
  • Cut point is set at a slope of 0
  • Negative slope indicates electrical heating
  • Positive slope indicates non-electrical (gas or no?) heating
  • Winter and summer temperatures determined from changepoint analysis

Example plot of heating type. Conclusion from this plot is electrical heating.

Everything to the right of the black vertical line in the following plot is classified as non-electrical heating while everything to the left is classified as electrical.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 206 rows containing non-finite values (stat_bin).
## Warning: Removed 8 rows containing missing values (geom_bar).

We can segment the plot to between -0.1 and 0.1 because the majority of slopes fall in that range.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 418 rows containing non-finite values (stat_bin).
## Warning: Removed 16 rows containing missing values (geom_path).

The question is where to draw the line for electrical heating. Currently the cut point is at a slope of 0, but this might need to be adjusted or we need to use a different method.

Thoughts?

6 Annual Consumption, Energy Use Intensity, Effective R Values

6.1 Annual Consumption in kWh

This is annual consumption for the most recent year.

6.1.1 Summary Statistics

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
## 6.118e+03 1.522e+06 2.349e+06 1.570e+07 9.013e+06 1.446e+09         9

6.1.2 Histogram of Annual Energy Use

The red vertical line is the median at 2.349e6 kWh per year.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 9 rows containing non-finite values (stat_bin).

## Energy Use Intensity

201 buildings have an energy use intensity value.

## Parsed with column specification:
## cols(
##   Type = col_character(),
##   Subtype = col_character(),
##   Value = col_double()
## )
EUI Stats by Building Type
buts count mean median sd reference
1 Educational 28 2139.41 45.28 7933.45 73.10000
2 Entertainment 1 60.95 60.95 NaN 44.78750
3 Food Sales & Service 19 275.95 238.76 98.73 229.85000
4 Healthcare 10 330.93 123.85 465.56 97.92857
5 Industrial 20 1547.27 294.58 5470.15 NaN
7 Office 37 144.53 51.84 238.30 NA
8 Public services 2 78.91 78.91 4.22 NA
9 Retail 67 99.48 52.95 163.18 103.18333
10 Skyscraper 3 48.41 49.77 13.79 NaN
11 Storage 6 109.48 96.18 69.76 58.20000
12 Utilities 3 3545.07 2960.05 3557.68 40.69000

A good visualization for the distribution of EUI values is a boxplot. I filtered out EUI values greater than 500 which are highly suspect.

## Warning: Removed 4 rows containing missing values (geom_point).

6.2 Effective Thermal Resistance

To give a sense of context, vacuum sealed panels, the top of the line insulation, have an effetive r-value of 50 hr F ft^2 / BTU.

In his thesis, Aaron cites a paper (Nordstrom et al. 2013[^1]) that examined R - values from 6 residential buildings in Sweden built from the 1960s to 2006 to validate the results he obtained from his function. The paper reports R - values between 9.1 to 23.7 hr F ft^2 / BTU.

![^1]G. Nordström, H. Johnsson, and S. Lidelöw, “Using the Energy Signature Method to Estimate the Effective U-Value of Buildings,” in Sustainability in Energy and Buildings, Springer, Berlin, Heidelberg, 2013, pp. 35–44.

Effective Thermal Resistance Stats by Building Type
buts count mean median sd
Educational 28 387.60 222.95 514.17
Entertainment 1 176.54 176.54 NaN
Food Sales & Service 19 5.62 5.58 2.21
Healthcare 10 68.08 56.92 53.66
Industrial 20 43.70 31.06 61.51
None 8 14.90 3.37 29.45
Office 40 137.66 27.51 262.82
Public services 2 111.10 111.10 54.98
Retail 67 234.10 268.99 130.94
Skyscraper 3 264.32 250.00 252.83
Storage 6 117.12 89.58 80.77
Utilities 3 10.45 0.95 17.05

Boxplots of Effective Thermal Resistance. The red vertical lines indicate the typical range as reported in the paper.

This function needs some work, and I plan on addressing it over winter break. It is based on a thermodynamic model as documented by Aaron in his paper. Professor Abramson has validated the method, but the implementation might need an adjustment. Any ideas would be appreciated.

7 Weather Correlations

The Pearson correlation coefficient measures the strength and direction of a linear relationship between two variables. The following plots show the Pearson corelation coefficient between weather variables and energy consumption.

7.1 Boxplots of Correlations

## Heatmaps for Climate Zones and Weather Variables

The following heatmaps show the average correlations between weather conditions energy consumption by climate zone. The dendograms cluster similar weather conditions and similar climate zones.

## 
## Attaching package: 'gplots'
## The following object is masked from 'package:wordcloud':
## 
##     textplot
## The following object is masked from 'package:stats':
## 
##     lowess
## Warning in image.default(z = matrix(z, ncol = 1), col = col, breaks =
## tmpbreaks, : unsorted 'breaks' will be sorted before use

## Warning in image.default(z = matrix(z, ncol = 1), col = col, breaks =
## tmpbreaks, : unsorted 'breaks' will be sorted before use

# Base Peak Ratio

The base to peak ratio is the average base load divided by the average peak load. This marker is segmented by winter and summer and by year so we can look at changes between the seasons as well as changes over the years.

  • Base peak ratio > 0.30 indicates an opportunity for savings by reducing the base load.

We can first look at the base to peak ratio statistics for the final year by sample set. These tables are grouped by season and arranged from lowest (best) to highest (worst) base to peak ratio.

  • pct_savings indicates the percentage of buildings in the sample set that can save by reducing baseload.
Winter Base Load Stats by Sample Set
styp season count mean median sd pct_savings
sampleset7 winter 19 0.2657895 0.250 0.0623891 0.2105263
sampleset10 winter 2 0.2950000 0.295 0.0212132 0.5000000
sampleset5 winter 8 0.4325000 0.390 0.1661110 0.8750000
sampleset3 winter 169 0.5182840 0.460 0.1991410 0.9112426
sampleset8 winter 30 0.5310000 0.510 0.2310299 0.8000000
sampleset6 winter 14 0.5585714 0.550 0.1037643 1.0000000
sampleset2 winter 357 0.5904482 0.560 0.1992030 0.9551821
sampleset9 winter 37 0.6183784 0.660 0.1815598 0.9729730
sampleset4 winter 137 0.6649635 0.710 0.2002209 0.9124088
Summer Base Load Stats by Sample Set
styp season count mean median sd pct_savings
sampleset7 summer 19 0.2878947 0.270 0.0686034 0.3157895
sampleset5 summer 7 0.3242857 0.330 0.0877225 0.7142857
sampleset10 summer 2 0.3300000 0.330 0.1555635 0.5000000
sampleset3 summer 169 0.4682249 0.410 0.2236237 0.7869822
sampleset6 summer 14 0.4835714 0.425 0.1726093 0.9285714
sampleset8 summer 30 0.4923333 0.425 0.2430791 0.7666667
sampleset2 summer 357 0.5435854 0.510 0.2199234 0.8711485
sampleset9 summer 37 0.5929730 0.620 0.2086713 0.8918919
sampleset4 summer 137 0.6018248 0.630 0.2133770 0.8978102

We can also look at boxplots for each sampleset. The blue vertical line indicates the threshold established for savings opportunities.

As a sanity check, we can look at a plot showing the relationship between the ratio during the summer and winter. We would expect this to be a positively linear relationship.

## Warning: Removed 1 rows containing missing values (geom_point).

7.2 Yearly Changes in Ratio

The base to peak ratio is calculated for each year, so we can look at the changes over the years to see which buildings are improving.

  • Change is defined as oldest ratio - most recent ratio
  • Positive change indicates reduction in ratio
  • Calculated only for buildings with more than one year of base load data
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Most buildings change relatively little, if at all over the years. Again, we can compare seasons to see if there is a correlation between the summer change in base peak ratio and winter change in ratio.

Ideally, a building would be in the upper right quadrant, with positive changes in both summer and winter.

## Warning: Removed 117 rows containing missing values (geom_point).

Improvement percent
styp season imp_pct
sampleset10 summer 0.5000000
sampleset10 winter 0.5000000
sampleset2 summer 0.3878788
sampleset2 winter 0.3878788
sampleset3 summer 0.4750000
sampleset3 winter 0.4750000
sampleset4 summer 0.4104478
sampleset4 winter 0.3955224
sampleset5 summer 0.6000000
sampleset5 winter 0.8000000
sampleset6 summer 0.4615385
sampleset6 winter 0.0769231
sampleset7 summer 0.7333333
sampleset7 winter 0.6000000
sampleset8 summer 0.5333333
sampleset8 winter 0.5000000
sampleset9 summer 0.2500000
sampleset9 winter 0.3888889

7.2.1 Base Peak Ratio Conclusions

A vast majority of buildings have base to peak ratio savings opportunities based on the threshold of 0.3. As expected, there is a positive linear relationship between the ratio during the summer and winter, providing a sanity check for our calculation.

Among buildings that show a change in ratio, the change in evenly split between improvement and worsening.

8 HVAC Schedule

The HVAC schedule function finds the most likely turn on and turn off times for business and non-business days.

First, we can look at business day turn on and turn off times by sample set.

One other thing to look at is typical length of operating day.

Average HVAC Schedule by Sample Set
styp mean_on mean_off hours
sampleset10 5.000 16.500 11.500
sampleset2 8.854 19.426 10.572
sampleset3 8.696 19.233 10.537
sampleset4 7.422 18.341 10.919
sampleset5 5.321 17.929 12.607
sampleset6 5.808 16.769 10.962
sampleset7 4.062 21.969 17.906
sampleset8 5.513 18.363 12.850
sampleset9 6.837 16.772 9.935

9 Correlation Plots

We can make some correlation plots to determine relationships that exist between building markers. The quantitative numbers can also be printed to look at the possible trends.

Correlation Matrix
eui r summer_ratio winter_ratio log_annc hours
eui 1.0000000 -0.3821435 0.3033090 0.2096193 0.1406890 -0.0036621
r -0.3821435 1.0000000 -0.2657449 -0.3014077 -0.2002057 -0.0561172
summer_ratio 0.3033090 -0.2657449 1.0000000 0.8520375 0.4474388 -0.2653689
winter_ratio 0.2096193 -0.3014077 0.8520375 1.0000000 0.5802950 -0.2566150
log_annc 0.1406890 -0.2002057 0.4474388 0.5802950 1.0000000 -0.1513398
hours -0.0036621 -0.0561172 -0.2653689 -0.2566150 -0.1513398 1.0000000

Another good option is to make pairwise plots. The diagonals show the distribution of the variable, and in the second plot, the asterisks indicate the significance of the relationship.

## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
## Loading required package: xts
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## The following objects are masked from 'package:data.table':
## 
##     first, last
## 
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:gplots':
## 
##     textplot
## The following object is masked from 'package:wordcloud':
## 
##     textplot
## The following object is masked from 'package:graphics':
## 
##     legend

10 Conclusions

11 Prediction Method

In order to meet ARPA-E milestone 4.1.1, we need to develop a predictive model that achieves an adjusted R2 greater than 0.85 when predicting six months. I wanted to test a random forest regression model for predictive capability. The details of the Random Forest are presented below, but the summary is the Random Forest is an extremely powerful model that maintains a level of interpretability.

11.1 Random Forest Description

The original paper describing Random Forests is by Leo Breiman.

To understand the powerful random forest, you first need to grasp the concept of a decision tree. The best way to describe a single decision tree is as a flowchart of questions about the variable values of an observation that leads in a classification/prediction. Each question (known as a node) has a yes/no answer based on the value of a particular variable. The two answer form branches leading away from the node. Eventually, the tree terminates in the final classification/prediction node called a leaf. A single decision tree can be arbitrarily large and deep depending on the number of features and the number of classes. They are adept at both classification and regression and can learn a non-linear decision boundary (they actually learn many small linear decision boundaries which collectively are non-linear). However, a single decision tree is very prone to overfitting, especially as the depth increases. The decision tree is flexible leading to a tendency to simply memorize the training data. To solve this problem, ensembles of decision trees are combined into a powerful classifier known as a random forest. Each tree in the forest is trained on a randomly chosen subset of the training data (either with replacement, called bootstrapping, or without) and on a subset of the features. This increases variability between trees making the overall forest more robust and less prone to overfitting. In order to make predictions, the random forest passes the features (values of variables) of the observation to all trees, and takes an average of the votes of each tree (known as bagging). The random forest can also weight the votes of each tree with respect to the confidence the tree has in its prediction. Overall, the random forest is fast, relatively simple, has a moderate level of interpretability, and performs extremely well on both classification and regression tasks. The random forest should be one of the first models tried on any machine learning problem and is generally my second approach after a linear model. There are a number of hyperparameters that must be specified for the forest ahead of time with the most important the number of trees in the forest, the number of features considered by each tree, the depth of the tree, and the minimum number of observations permitted at each leaf of the tree. These can be selected by training many different models with varying hyperparameters and selecting the combination that performs best on cross-validation or a testing set. A random forest performs implicit feature selection and can return the relative importances of the features so it can be used as a method to reduce dimensions for additional algorithms.

A simplified model of a decision tree used for exactly this task is presented below

11.2 Methodology

In order to test the accuracy of the method, I trained the model on all data except for the final six months. I then took the final six months of data and made predictions for the electricity consumption. These predictions were compared to the known true values to assess the predictive capabilites of the random forest. This procedure was then completed for all buildings in HBase.

11.3 Results

11.3.1 Typical Predictions

The following are predictions made for the Progressive APS building in Phoenix, Arizona. The rsquared value for these predictions was 0.933.

12 Animations

## Warning: Ignoring unknown aesthetics: frame, cumulative
## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 3 rows containing missing values (geom_point).

## Warning: Removed 3 rows containing missing values (geom_point).

## Warning: Removed 3 rows containing missing values (geom_point).

## Warning: Removed 3 rows containing missing values (geom_point).

## Warning: Removed 3 rows containing missing values (geom_point).

## Warning: Removed 3 rows containing missing values (geom_point).

## Warning: Removed 3 rows containing missing values (geom_point).
## Executing: 
## "convert -loop 0 -delay 25 Rplot1.png Rplot2.png Rplot3.png
##     Rplot4.png Rplot5.png Rplot6.png Rplot7.png Rplot8.png
##     Rplot9.png Rplot10.png Rplot11.png Rplot12.png Rplot13.png
##     Rplot14.png Rplot15.png Rplot16.png Rplot17.png Rplot18.png
##     Rplot19.png Rplot20.png Rplot21.png Rplot22.png Rplot23.png
##     Rplot24.png Rplot25.png Rplot26.png Rplot27.png Rplot28.png
##     Rplot29.png Rplot30.png Rplot31.png Rplot32.png Rplot33.png
##     Rplot34.png Rplot35.png Rplot36.png Rplot37.png Rplot38.png
##     Rplot39.png "map.gif""
## Output at: map.gif